proton by gnurizen · Pull Request #15 · parca-dev/parcagpu

gnurizen · 2026-04-09T14:06:07Z

Rewrite parcagpu to use Proton's CUPTI infrastructure
Update proton callback API names for upstream sync
Update proton submodule to latest upstream sync
Replace interval-based rate limiter with token bucket algorithm
Add activity_batch USDT probe and fix test infrastructure
Various fixes to make arm64 work
And make amd64 compile too
Small cleanups/formatting
Shorten names
Checkpoint PC sampling tweaking
Stall reason map handling, prepping for batched pc samples
Flush out cubin processing, sass lookup and pc sampling probe batching
PC sampling: probabilistic windowed start/stop with KERNEL_SERIALIZED mode
Cleanup related to usdt/cupti extraction

Major changes: - Use Proton as a git submodule for CUPTI callback handling - Rewrite in C++ using Proton's CuptiApi and callback patterns - Add PC sampling support for continuous GPU profiling - Simplify build to single library (works with any CUDA version at runtime) - Use CMake build system - Consolidate GitHub workflows into single build.yml - Update Dockerfile to Ubuntu 24.04 (fixes USDT probe generation) The library now uses Proton's dynamic CUPTI loading, so a single build works with CUDA 12.x and 13.x at runtime. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

setDriverCallbacks renamed to setLaunchCallbacks in upstream proton.

The simple 500μs interval check could only pass 2000 samples/sec regardless of actual load. The token bucket (configurable via PARCAGPU_RATE_LIMIT, default 100/sec) smooths bursts while maintaining a predictable average rate. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add parcagpuActivityBatch() probe that fires with batches of up to 128 activity record pointers, enabling BPF consumers to read kernel timing data directly from CUPTI buffers without per-record probe overhead. Build/test changes: - Link test binary against mock CUPTI/CUDA with --no-as-needed so Proton's dlopen(RTLD_NOLOAD) finds the mocks at runtime - Fix make test to run the test binary directly with LD_LIBRARY_PATH (ctest had no tests registered) - Add make bpf-test and make test-multi targets for BPF activity parser integration testing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… mode Implement interval-gated probabilistic PC sampling that only serializes kernels during active sampling windows, not for the entire process lifetime. Architecture: - CUPTI lifecycle: enable (once) → start/stop (per window) → disable (once) - Enable START_STOP_CONTROL attribute so start/stop work from CUPTI callbacks - Collection mode is KERNEL_SERIALIZED for per-kernel correlation - Probabilistic window: every PARCAGPU_PC_SAMPLING_INTERVAL seconds, roll a PARCAGPU_PC_SAMPLING_PROBABILITY die; if it hits, start sampling until the window closes, then stop and drain data - start()/stop() are mutex-guarded and idempotent (no double-start/stop races) - ctxSynchronize before start to satisfy CUPTI's GPU-idle requirement Key changes: - pc_sampling.cpp: Session-based enable with per-window start/stop, semaphore- gated stall reason map replay (replaces rate-limited emission), CUPTI 12.4 ABI version check (v22 correlationId boundary), graceful permission failure handling in enable - cupti.cpp: Probabilistic window state machine in ENTER/EXIT callbacks, env var config (probability, interval), env_config validation - probes.d: Add error USDT probe for surfacing CUPTI failures to BPF - test/mock_cupti.c: Full PC sampling mock with real cubin from pc_sample_toy, real SASS offsets for source-line correlation, 11-entry sample table cycling through shmem_bounce/hash_churn/trig_storm kernels - test/mock_cuda.c: Add cuCtxSynchronize stub - test/test-pc-mock.sh: New GPU-less test using mock libs and real cubin - test/test-pc-real.sh: Set probability=1 interval=0.5 for reliable test hits - test/bpf/: Move CUPTI struct defs to shared cupti_bpf.h, add error event handling, CUDA 12.4+ correlationId support - test/CMakeLists.txt: Build mock CUDA driver library Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

usdt headers now live in parca-dev/usdt and the cupti bpf headers now live in this project. So we don't need to vendor otel anymore.

gnurizen · 2026-04-09T14:17:33Z

I'm gonna squash first and redo this

gnurizen and others added 14 commits March 23, 2026 16:42

Update proton callback API names for upstream sync

5951652

setDriverCallbacks renamed to setLaunchCallbacks in upstream proton.

Update proton submodule to latest upstream sync

5fbd8eb

Various fixes to make arm64 work

59e0b35

And make amd64 compile too

73410e7

Small cleanups/formatting

16a9e54

Shorten names

8b50a77

Checkpoint PC sampling tweaking

d32344d

Stall reason map handling, prepping for batched pc samples

f87ed73

Flush out cubin processing, sass lookup and pc sampling probe batching

6496aa7

Cleanup related to usdt/cupti extraction

eece2f4

usdt headers now live in parca-dev/usdt and the cupti bpf headers now live in this project. So we don't need to vendor otel anymore.

gnurizen closed this Apr 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proton#15

proton#15
gnurizen wants to merge 14 commits intomainfrom
proton

gnurizen commented Apr 9, 2026

Uh oh!

gnurizen commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gnurizen commented Apr 9, 2026

Uh oh!

gnurizen commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant